Implementing the Fisher's Discriminant Ratio in a k-Means Clustering Algorithm for Feature Selection and Data Set Trimming
نویسندگان
چکیده
The Fisher's discriminant ratio has been used as a class separability criterion and implemented in a k-means clustering algorithm for performing simultaneous feature selection and data set trimming on a set of 221 HIV-1 protease inhibitors. The total number of molecular descriptors computed for each inhibitor is 43, and they are scaled to lie between 1 and 0 before being subjected to the feature selection process. Since the purpose is to select some of the most class sensitive descriptors, several feature evaluation indices such as the Shannon entropy, the linear regression of selected descriptors on the pKi of selected inhibitors, and a stepwise variable selection program are used to filter them. While the Shannon entropy provides the information content for each descriptor computed, more class sensitive descriptors are searched by both the linear regression and stepwise variable selection procedures. The inhibitors are divided into several different numbers of classes. They are subsequently divided into five classes due to the fact that the best feature selection result is obtained by the division. Most of the good features selected are the topological descriptors, and they are correlated well with the pKi values. The outliers or the inhibitors with less class-sensitive descriptor values computed for each selected descriptor are identified and gathered by the k-means clustering algorithm. These are the trimmed inhibitors, while the remaining ones are retained or selected. We find that 44% or 98 inhibitors can be retained when the number of good descriptors selected for clustering is three. The descriptor values of these selected inhibitors are far more class sensitive than the original ones as evidenced by substantial increasing in statistical significance when they are subjected to both the SYBYL CoMFA PLS and Cerius2 PLS regression analyses.
منابع مشابه
Feature selection using genetic algorithm for classification of schizophrenia using fMRI data
In this paper we propose a new method for classification of subjects into schizophrenia and control groups using functional magnetic resonance imaging (fMRI) data. In the preprocessing step, the number of fMRI time points is reduced using principal component analysis (PCA). Then, independent component analysis (ICA) is used for further data analysis. It estimates independent components (ICs) of...
متن کاملPersistent K-Means: Stable Data Clustering Algorithm Based on K-Means Algorithm
Identifying clusters or clustering is an important aspect of data analysis. It is the task of grouping a set of objects in such a way those objects in the same group/cluster are more similar in some sense or another. It is a main task of exploratory data mining, and a common technique for statistical data analysis This paper proposed an improved version of K-Means algorithm, namely Persistent K...
متن کاملData Clustring Using A New CGA(Chaotic-Generic Algorithm) Approach
Clustering is the process of dividing a set of input data into a number of subgroups. The members of each subgroup are similar to each other but different from members of other subgroups. The genetic algorithm has enjoyed many applications in clustering data. One of these applications is the clustering of images. The problem with the earlier methods used in clustering images was in selecting in...
متن کاملData Clustring Using A New CGA(Chaotic-Generic Algorithm) Approach
Clustering is the process of dividing a set of input data into a number of subgroups. The members of each subgroup are similar to each other but different from members of other subgroups. The genetic algorithm has enjoyed many applications in clustering data. One of these applications is the clustering of images. The problem with the earlier methods used in clustering images was in selecting in...
متن کاملA Hybrid Data Clustering Algorithm Using Modified Krill Herd Algorithm and K-MEANS
Data clustering is the process of partitioning a set of data objects into meaning clusters or groups. Due to the vast usage of clustering algorithms in many fields, a lot of research is still going on to find the best and efficient clustering algorithm. K-means is simple and easy to implement, but it suffers from initialization of cluster center and hence trapped in local optimum. In this paper...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Journal of chemical information and computer sciences
دوره 44 1 شماره
صفحات -
تاریخ انتشار 2004